library("dplyr")
library("ggplot2")
library("plotly")
options(scipen = 999) # to prevent scientific notation displayA7: Kickstarter
Loading Libraries
Loading Data
kickstarter_df <- read.csv("ks-projects-201801.csv")
glimpse(kickstarter_df)Rows: 378,661
Columns: 15
$ ID <int> 1000002330, 1000003930, 1000004038, 1000007540, 10000…
$ name <chr> "The Songs of Adelaide & Abullah", "Greeting From Ear…
$ category <chr> "Poetry", "Narrative Film", "Narrative Film", "Music"…
$ main_category <chr> "Publishing", "Film & Video", "Film & Video", "Music"…
$ currency <chr> "GBP", "USD", "USD", "USD", "USD", "USD", "USD", "USD…
$ deadline <chr> "2015-10-09", "2017-11-01", "2013-02-26", "2012-04-16…
$ goal <dbl> 1000, 30000, 45000, 5000, 19500, 50000, 1000, 25000, …
$ launched <chr> "2015-08-11 12:12:28", "2017-09-02 04:43:57", "2013-0…
$ pledged <dbl> 0.00, 2421.00, 220.00, 1.00, 1283.00, 52375.00, 1205.…
$ state <chr> "failed", "failed", "failed", "failed", "canceled", "…
$ backers <int> 0, 15, 3, 1, 14, 224, 16, 40, 58, 43, 0, 100, 0, 0, 7…
$ country <chr> "GB", "US", "US", "US", "US", "US", "US", "US", "US",…
$ usd.pledged <dbl> 0.00, 100.00, 220.00, 1.00, 1283.00, 52375.00, 1205.0…
$ usd_pledged_real <dbl> 0.00, 2421.00, 220.00, 1.00, 1283.00, 52375.00, 1205.…
$ usd_goal_real <dbl> 1533.95, 30000.00, 45000.00, 5000.00, 19500.00, 50000…
Processing Data
kickstarter_df <-
kickstarter_df |>
mutate(deadline_date = as.Date(deadline),
launched_date = as.Date(launched))
kickstarter_df <-
kickstarter_df |>
mutate(deadline_year = format(deadline_date, "%Y"),
launched_year = format(launched_date, "%Y"))
glimpse(kickstarter_df)Rows: 378,661
Columns: 19
$ ID <int> 1000002330, 1000003930, 1000004038, 1000007540, 10000…
$ name <chr> "The Songs of Adelaide & Abullah", "Greeting From Ear…
$ category <chr> "Poetry", "Narrative Film", "Narrative Film", "Music"…
$ main_category <chr> "Publishing", "Film & Video", "Film & Video", "Music"…
$ currency <chr> "GBP", "USD", "USD", "USD", "USD", "USD", "USD", "USD…
$ deadline <chr> "2015-10-09", "2017-11-01", "2013-02-26", "2012-04-16…
$ goal <dbl> 1000, 30000, 45000, 5000, 19500, 50000, 1000, 25000, …
$ launched <chr> "2015-08-11 12:12:28", "2017-09-02 04:43:57", "2013-0…
$ pledged <dbl> 0.00, 2421.00, 220.00, 1.00, 1283.00, 52375.00, 1205.…
$ state <chr> "failed", "failed", "failed", "failed", "canceled", "…
$ backers <int> 0, 15, 3, 1, 14, 224, 16, 40, 58, 43, 0, 100, 0, 0, 7…
$ country <chr> "GB", "US", "US", "US", "US", "US", "US", "US", "US",…
$ usd.pledged <dbl> 0.00, 100.00, 220.00, 1.00, 1283.00, 52375.00, 1205.0…
$ usd_pledged_real <dbl> 0.00, 2421.00, 220.00, 1.00, 1283.00, 52375.00, 1205.…
$ usd_goal_real <dbl> 1533.95, 30000.00, 45000.00, 5000.00, 19500.00, 50000…
$ deadline_date <date> 2015-10-09, 2017-11-01, 2013-02-26, 2012-04-16, 2015…
$ launched_date <date> 2015-08-11, 2017-09-02, 2013-01-12, 2012-03-17, 2015…
$ deadline_year <chr> "2015", "2017", "2013", "2012", "2015", "2016", "2014…
$ launched_year <chr> "2015", "2017", "2013", "2012", "2015", "2016", "2014…
Initial Calculations
num_total_projects <-
kickstarter_df |>
nrow()
paste(num_total_projects, "total projects")[1] "378661 total projects"
num_successful_projects <-
kickstarter_df |>
filter(state == "successful") |>
nrow()
paste(num_successful_projects, "successful projects")[1] "133956 successful projects"
num_failed_projects <-
kickstarter_df |>
filter(state == "failed") |>
nrow()
paste(num_failed_projects, "failed projects")[1] "197719 failed projects"
percent_failed_projects <-
round((num_failed_projects / num_total_projects) * 100, 2)
paste0(percent_failed_projects, "% of projects were failed")[1] "52.22% of projects were failed"
In total, there were 378661 total projects started on Kickstarter. Of these projects, 133956 were successful, while 197719 were failed. This means that 52.22% of projects were failed.
Biggest Non-Success
biggest_non_success_project <-
kickstarter_df |>
filter(state != "successful") |>
filter(usd_pledged_real == max(usd_pledged_real))
biggest_non_success_project_name <-
biggest_non_success_project |>
select(name) |>
pull()
biggest_non_success_project_goal <-
biggest_non_success_project |>
select(usd_goal_real) |>
pull()
biggest_non_success_project_pledged <-
biggest_non_success_project |>
select(usd_pledged_real) |>
pull()The biggest non-success project on Kickstarter was The Skarp Laser Razor: 21st Century Shaving (Suspended). Its funding goal in USD was $160000, but it actually recieved $4005111.42. This project was based in Irvine, California and proposed an alternative to laser hair treatment by targeting the chromophores in hair rather than hair follicles. The project gained more than $4 million in pledges in less than three works, but was cancelled by Kickstarter in October 2015 because Skarp allegedly acted in violation of Kickstarter’s rules: they failed to provide a working prototype for their project. Despite some efforts to revive the product launch, it has still not been released.
Project State
kickstarter_by_state_df <-
kickstarter_df |>
group_by(state) |>
summarize(
number_of_projects = n()
)
kickstarter_by_state_plot <-
ggplot(kickstarter_by_state_df) +
geom_col(aes(
x = state,
y = number_of_projects
)
) +
labs(
title = "Number of Kickstarter Projects in each State",
x = "State",
y = "Number of Projects"
)
ggplotly(kickstarter_by_state_plot)More often that not, Kickstarter projects tend to fail rather than succeed. The other most common outcome for a Kickstarter project is to be cancelled. After that, projects tend to be undefined, live (so in progress), and then–least commonly–suspended.
Yearly Summary
yearly_summary_df <-
kickstarter_df |>
group_by(launched_year) |>
summarize(
count = n(),
num_successful_projects = sum(state == "successful", na.rm = TRUE),
percent_success = round((num_successful_projects / count) * 100, 2),
mean_pledged = mean(usd_pledged_real, na.rm = TRUE)
)
print(yearly_summary_df)# A tibble: 11 × 5
launched_year count num_successful_projects percent_success mean_pledged
<chr> <int> <int> <dbl> <dbl>
1 1970 7 0 0 0
2 2009 1329 579 43.6 2141.
3 2010 10519 4593 43.7 2800.
4 2011 26237 12171 46.4 3954.
5 2012 41165 17892 43.5 7833.
6 2013 44851 19415 43.3 10670.
7 2014 67745 21107 31.2 7744.
8 2015 77300 20971 27.1 8895.
9 2016 57184 18766 32.8 11488.
10 2017 52200 18462 35.4 11954.
11 2018 124 0 0 510.
I chose to use the column ‘launched_year’ to denote the year of projects. I decided to do this because it guarantees that all of the year data will be either in the present or past. I think this makes the analysis more robust and grounded in the present. Even though this particular dataset does not contain projects with future deadlines (for example, projects in 2026 that are still ongoing), using launched_year establishes a consistent standard that would still work well for datasets where deadlines may extend into the future. This makes the analysis more robust and easier to interpret.
For ‘percent_success’, I calculated it as the percentage of projects labelled “successful” out of all projects launched in that year. I chose to do it like this because the real important outcome of a project is just whether it succeeds or not. Excluding all of the other possible values for state other than ‘failure’ (e.g. ‘live’, ‘cancelled’, ‘undefined’, ‘suspended’) is just an unnecessary step in my opinion.
For the summary value about pledges, I chose to compute the mean because I think it best captures the trend of pledges across the years. While the mean is a computation representing all values, metrics like min, max, and median just display one value. This could provide misleading insights. For example, if 2016 were to have a minimum of $200, while 2017 has a minimum of $10000, this might mislead someone to think that the pledges in 2017 were higher. In reality, maybe 2016 has a mean pledge of $1909384, while 2017 could have a mean pledge of $94587. Metrics like min, max, and median just don’t capture the spread of data as well as mean does.
yearly_summary_plot <-
ggplot(yearly_summary_df) +
geom_col(aes(
x = launched_year,
y = mean_pledged
)
) +
labs(
title = "Average Pledge Value of Kickstarter Projects over the Years",
x = "Year",
y = "Average Pledge value in USD"
)
ggplotly(yearly_summary_plot)It seems that after Kickstarter launched in 2009, the average pledge value in USD for projects continually increased by year for the most part. After a peak of approximately $10669 in 2013 though, 2014’s average pledge value did drop down to approximately $7744, but it then slowly started increasing again. For some reason, the average pledge value dropped dramatically from 2017 to 2018 (i.e. approximately $11954 to $509).
Unusual Values
year_1970_df <-
kickstarter_df |>
filter(launched_year == 1970)
glimpse(year_1970_df)Rows: 7
Columns: 19
$ ID <int> 1014746686, 1245461087, 1384087152, 1480763647, 33094…
$ name <chr> "Salt of the Earth: A Dead Sea Movie (Canceled)", "1s…
$ category <chr> "Film & Video", "Art", "Film & Video", "Theater", "Mu…
$ main_category <chr> "Film & Video", "Art", "Film & Video", "Theater", "Mu…
$ currency <chr> "USD", "USD", "USD", "USD", "USD", "USD", "CHF"
$ deadline <chr> "2010-09-15", "2010-08-14", "2010-05-21", "2010-06-01…
$ goal <dbl> 5000, 15000, 700, 4000, 10000, 500, 1900
$ launched <chr> "1970-01-01 01:00:00", "1970-01-01 01:00:00", "1970-0…
$ pledged <dbl> 0, 0, 0, 0, 0, 0, 0
$ state <chr> "canceled", "canceled", "canceled", "canceled", "canc…
$ backers <int> 0, 0, 0, 0, 0, 0, 0
$ country <chr> "US", "US", "US", "US", "US", "US", "CH"
$ usd.pledged <dbl> 0, 0, 0, 0, 0, 0, 0
$ usd_pledged_real <dbl> 0, 0, 0, 0, 0, 0, 0
$ usd_goal_real <dbl> 5000.00, 15000.00, 700.00, 4000.00, 10000.00, 500.00,…
$ deadline_date <date> 2010-09-15, 2010-08-14, 2010-05-21, 2010-06-01, 2010-…
$ launched_date <date> 1970-01-01, 1970-01-01, 1970-01-01, 1970-01-01, 1970-…
$ deadline_year <chr> "2010", "2010", "2010", "2010", "2010", "2010", "2015"
$ launched_year <chr> "1970", "1970", "1970", "1970", "1970", "1970", "1970"
An unusual value I notice is that there were 7 projects launched on January 1st, 1970 that were all cancelled in 2010. Upon further inspection, all 7 projects had funding goals ranging from $500 to $15000, but none of them received any money at all (i.e. ‘usd_pledged_real’ = 0 for all projects), and they were all canceled.
These values are unusual because Kickstarter was only launched in 2009, so I am not sure how there are some projects in the dataset that were supposedly launched in 1970.
Rather than actually represent a project launched in 1970, “1970” might represent a missing or invalid date. For example, if a project’s launch date was missing or corrupted when the dataset was created, it may have been automatically converted to 1970. This could explain why all 1970 projects have identical and impossible launch dates, have zero pledged dollars, and are recorded as cancelled.
A follow-up question I would want to ask the dataset creators for these values is: How were missing or invalid launch dates handled during data cleaning, and does a launch year of 1970 indicate a default or placeholder value rather than a real project launch date?